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. . . Abstract. Modern text retrieval systems often provide a similarity search utility, that allows the user 

to find efficiently a fixed number k of documents in the data set that are most similar to a given query 
(here a query is either a simple sequence of keywords or the identifier of a full document found in previous 
searches that is considered of interest). We consider the case of a textual database made of semi-structured 
documents. For example, in a corpus of bibliographic records any record may be structured into three 
fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, is 
^ modelled with a specific vector space. The problem is more complex when we also allow each such vector 

space to have an associated user-defined dynamic weight that influences its contribution to the overall 
dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by 
Singitham et al. in [18] in VLDB 2004. Their proposed solution, which we take as baseline, is a variant 
of the cluster-pruning technique that has the potential for scaling to very large corpora of documents, 
and is far more efficient than the naive exhaustive search. We devise an alternative way of embedding 
, weights in the data structure, coupled with a non-trivial application of a clustering algorithm based on the 

^ ' furthest point first heuristic for the metric k-center problem. The validity of our approach is demonstrated 

jyj ' experimentally by showing significant performance improvements over the scheme proposed in [18]. We 

O ■ improve significantly tradeoffs between query time and output quality with respect to the baseline method 

in [18], and also with respect to a novel method by Chierichetti et al. to appear in ACM PODS 2007 [3]. 
We also speed up the pre-processing time by a factor at least thirty. 

> 

^ ■ 1 Introduction 

ly-^ . Singitham et al. in [18] consider the following problem: given set S oi s sources of evidence 
O I and a set E of n records, they define for each record G E and each source Sj G S* a source 
^ score ai{e^), moreover for each source Sj we have a scalar positive weight wt that is user-defined 
^ ■ and changes dynamically for each query. The dynamic aggregated score of is Yli=i'WiO'i{e^). 
The Dynamic Vector Score Aggregation problem is to find the k elements in E with the highest 
dynamic aggregate score. The authors note that in absence of any further structure the only 
I solution is an exhaustive computation of the aggregate score for all the elements in E and 
the determination of the k highest elements in the ranking induced by the aggregation score. 
Therefore they consider the special case when each feature of the records is actually a vector, 
and the source score function ai{e^) is a geometric distance function measuring the distance 
of to a query point q (equivalently one can define a dual similarity function to the same 
purpose). 
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They also observe that if s = 1 and the source score is a geometric proximity function (e.g. a 
metric) to a query point then this problem reduces to the classical fc-nearest neighbor problem. 
The difficulty in handling the /c-nearest-neighbor problem in the general case of a linear combi- 
nation of s > 2 geometric proximity functions stems from the need of combining the scores form 
generally unrelated sources compounded with the presence of arbitrary positive weights. In [18] 
the Vector Score Aggregation problem is solved by extending the cluster pruning technique for 
the geometric k-nearest-neighbor. 



1.1 Our contribution 

In this paper we use the cluster pruning approach but we derive a new and much simpler way 
of (not) embedding dynamic weights in vector cluster pruning similarity searches. Moreover by 
using a different clustering strategies and techniques we obtain further benefits. In particular 
we will describe alternatives for the following key aspects: 

(a) How weights are embedded in the scheme. In the general Vector Score Aggregation 
problem the user supplies a query (this can be either a document in the database or a collection 
of keywords that capture the concept being searched for) and a weight- vector that express the 
user's perception of the relative importance of the document features in capturing the informal 
notion of "similarity". We show in Section 4 that, surprisingly, one need not be concerned with 
dynamic weights at all during pre-processing, the solution for the unweighted case is good also 
for the weighted one. 

(b) Multiple clusterings. In cluster pruning search one decides beforehand to visit a certain 
number of clusters whose "leaders" are closest to the query point. However, there is a hidden 
law of diminishing returns: clusters further away from the query are less likely to contain good 
k-neighbors. We use a different strategy: we form not one but several (three in our experiments) 
different independent clusterings and we search all three of them but looking into fewer clusters 
in each clustering. 

(c) The ground clustering algorithm. When searching for nearest neighbors of a query 
point q it is natural to consider a cluster good for such a search when its diameter is small. This 
leads to considering the optimal K-center problem (i.e. finding a decomposition minimizing the 
maximum diameter of any chistcr produced) as a better objective to attain with respect to 
other conceivable objectives. Thus we are led to consider the Furthest-Point-First heuristic, 
that is 2-competitive for this problem [15]. We attain two benefits: (1) quality of the output is 
increased, as demonstrated by the experiments in Section 7, (2) preprocessing time is reduced 
by orders of magnitude since we can use fast variants of this algorithm (see e.g. [12, 11]). 

By introducing these three variations we significantly outperform state of the art algorithms 
for this problem. 



1.2 Experimental results 

We have run our algorithm against two baselines. The first baseline is the algorithm in [18] 
that uses k-means clustering. The second base-line is the algorithm in [18] modified so to use 
a simple cluster pruning randomized strategy proposed in a forthcoming paper by Chierichetti 
et al. [3]. We perform tests on data sets of 50K and lOOK documents using a variety of weights 
and randomly chosen query documents. Figure 2 shows the query time/recall tradeoff of the 
three methods. Our method is clearly dominant giving consistently better quality results in less 
time. Quality data are also given in tabular form in Table 2. 

The top portion of Table 2 corresponds to the case of equal weights, that is equivalent to the 
unweighted case, and already our method shows better time/quality tradeoffs than both the 
baselines. In the entries of Table 2 for unequal weights our scheme is vastly superior in recall, 
even doubling the number of true k-nearest neighbors found using less time over both baselines. 
The overall quality of the retrieved nearest neighbors, as measure via the normalized aggregated 
goodness, is also improved: this indicates that our method is robust and stable relative to the 
baselines. 

The simpler clustering strategy in [3] has preprocessing time close to ours, but quality/cost 
performance inferior to our scheme and to that in [18]. Also the improvement in preprocessing 
time is noteworthy, we gain a factor 30 against [18] in a test with 100,000 documents. In practice 
we could complete the preprocessing in one day compared to one month required by [18]. 

1.3 Organization of the paper 

This paper is organized as follows. In Section (2) we give a brief review of the state of the 
art methods more relevant to our setting, while a more extended survey is postponed in the 
full paper. In Section (3) we review known properties of the cosine similarity /distance met- 
ric. In Section (4) we show the main theoretical analysis underpinning our weight embedding 
technique. In Section (5) we describe and compare the algorithm that uses our new weight 
embedding scheme, and the scheme proposed in [18]. In Section (6) we describe how the out- 
put quality is measured. In Section (7) we give the experimental set up and the experimental 
results. Conclusions and future work are in Section (8). 

2 Brief state of the Art 

There is a vast literature on similarity searching and k-nearest neighbor problems (see extended 
surveys in [16, 2]). However, much less is known for the case when users are allowed to change 
the underlying metric dynamically at query time. Besides the work of [18] we mention work 
by P. Ciaccia and M. Patella [4] discussing which general relations should hold between two 
metrics A and B, that allow to build a data structure using the first metric (A), but perform 
searches according to the second one (B). 

A series of papers by R. Fagin and co-authors [6, 8, 10, 9] deal with the problem of rank score 
aggregation in a general setting in which items are ranked independently according to several 



sources (not necessarily due to geometric distance functions) and one seeks to find efficiently 
the best combined ranking. The same problem but in a distributed setting is discussed in [17]. 
In these papers the rankings are assumed to be statically available and the issue is only how 
to combine the rankings efficiently. Equivalently, in their model the cost for producing the 
independent rankings is not accounted for. Our setting is different since the total search cost 
is considered. 

Many other schemes are known for the classical (unweighted, s = 1) k-nearest neighbor 
computation that, however, are mostly useful in low dimensional spaces (or in high dimen- 
sional spaces with dense vectors). Applications in text retrieval are characterized by very high 
dimensionality of the corresponding vector space (in the region of tens of thousands) and sparse 
vector representations. In general, sophisticated tree-based schemes are ineffective on such data 
sets. For example the tree-based algorithm M'^-tree proposed by Bustos and Skopal [1] has been 
tested on a data set in dimension 89. The rank-aggregation method of R. Fagin, R. Kumar and 
D. Sivakumar [7] has been tested on data sets in dimension 800, while the hashing based method 
of A. Gionis, P. Indyk and R. Motwani [13] has been tested data sets in on dimension 64. The 
p-sphere method [14] is used in [3] as base-line since it has been shown to be superior to a series 
of other data structures proposed in literature. Experiments in [3] found that their random 
clustering method performs better than the p-sphere method on textual data. 

3 Metric Spaces and Cosine Similarity 

In this paper we will use mostly distance measures, therefore we will convert all results and 
algorithms for similarity measures into distances. As noted in [5] the inner product of two vectors 
X and y of length 1 (in norm 2) that is the standard cosine similarity of two normalized vector is 
turned into a distance d{x, y) = 1 — x-y. This distance function is not a metric in a strict sense 
since the triangular inequality is not satisfied, however the following derivation jja:— yjll^x-x-t- 
y-y— 2x-y=2(l— x-y) = 2d{x, y) shows that the square root of the distance is indeed a metric. 
Equivalently one can say that it satisfies the extended triangular inequality {d{x, y)°'+d{y, z)" > 
d{x,z)°') with parameter a = 1/2. Moreover a linear combination of distance functions with 
positive weights defined on the same space is still a metric space D{x,y) = J2iU!idi{x,y) for 
Wi > 0. Thus the aggregate vector score function used in [18] although not giving rise to a 
metric in a strict sense is nonetheless closely related to a metric space. 

4 Simplification of the vector score aggregation problem 

In [18] the queries are of the form = (q'l, ••, ?s) where each is a vector of unit length, moreover 
the user supplies a weight vector w = {wi, .., Wg) where each Wi is a positive scalar weight, and 
the weights sum to 1. The element in the input set E is of the form (ej, .., e^) where each 
el is a vector of unit length. The aggregate distance function is dAoiQ, e-') = 1 — Y.i'^iiQ.i • ej). 
While the aggregate similarity is: sad{(1-i e^) — 1 — dAoiQ, e-') — JZi^ilqi ■ el). 



Linecirity. One should notice that because of the hnearity of the summation and the inner 
product operators the weights can be associated to the vector space: J2i Wi{qi-el) = J2i Qi-Wicl = 
q ■ weK This association has been chosen in [18] thus the challenge arises from the fact that one 
has to do pre-processing without knowing the real weights that are supplied on-line at query 
time. 

A different aggregation. Let q be a query point, c a center of cluster C(c), p a point in 
cluster C(c), and D a distance function that satisfies the extended triangular inequality with 
parameter a. The effectiveness of clustering search stems from the observation that the distance 
D{q,p) is bounded by an increasing function of D{q,c) and D{c,p). Moreover when p e C(c), 
the distance D{c,p) has the smallest value over all centers in the clustering. Thus using the 
center c closest to q gives us the best possible upper estimate of the distance D{q,p). We have: 

D{q,p)<{D{q,cr + D{c,pr)Y/^ 

Consider now the weighted similarity WS: 

WS{w,q,p) = ^Wiipi ■ qi) = ^{Wiqi) ■ Pi ^ Qw ■ P- 

i i 

where = [wiqi, .., Wgqs] is the weighted query vector of vectors. Since the linear combination 
of weights and queries might not result in a unit length vector we perform a normalization 
(depending only in the weights and query point) and obtain a normalized weighted distance 
NWD: 

NWD{w, q,p) = 1 - = 1 - ^-/l^-l ■P = ^(^-'^^)' 

\^w\ 

where Qw/\Qw\ — Q'w the normalized weighted query vector of vectors. Now we are in the 
condition of using the above generalized triangular inequality and establish that: 

NWD{w,q,p) = DiQ'^,p) < {D{Ql,cr + D{c,pr)Y'^. 

Since D{c,p) is independent of the pair q, w we can do at preprocessing time a clustering based 
on the input set E and the distance D{., .), regardless of weights and queries. At query time 
we can compute L>((5(y,c) and combine this value with D{c,p) to get the upper estimate of 
NWD{w, q,p) that guides the searching. The conclusion of this discussion is that using cosine 
similarity the mult i- dimensional weighted case can be reduced to a mono-dimensional (i.e. not 
weighted case) for which we have good data structures. 

5 Algorithms 

5.1 Basic Cluster Pruning Seairching 

For s = 1 the cluster pruning technique works as follows. Let be a set of points in d- 
dimensional space, and D{., .) the distance function among pairs of points. The set E is clustered 



into K groups so to minimize some functional depending on the distance. Then for each cluster a 
representative point is elected. When a query point q is given, firs one finds out a set of A;'-nearest 
neighbors among the representative points, for example by exhaustive search. Afterwards only 
the clusters whose representative have been selected are searched exhaustively for the A;-nearest 
neighbors. All other clusters are not examined, thus avoiding computing distances from q and 
the majority of the points in E. This procedure is heuristic insofar as there is no guarantee 
that all true fc-nearest neighbors are found, however in practice, by a careful choice of A;, k' and 
K one can detect a large fraction of the true fc-nearest neighbors, while accessing only a small 
portion of the input data set. 

5.2 Our weight embedding scheme and algorithm 

The discussion in Section (4) shows that the pre-processing can be done independently of the 
user provided weights and that any distance based clustering scheme can be used in principle. 
Weights are used to modify directly the input query point and are relevant only for the query 
procedure. The basic clustering algorithm we use is described in detail in [11]. It is an algorithm 
based on the further-point-first (FPF) heuristic for the k-center problem that was proposed 
by [15]. Summarizing, to produce K clusters we start by taking a sample of yKn points 
out of n points, and we apply the furthest-point-first method on the sample to produce K 
centers. The remaining points are associated to the closest center iteratively, while adjusting 
the representative point (medoid) of the cluster at each addition of a point to that cluster. 

The new twist is that we apply [11] three times on three different random samples and we 
collect all the (overlapping) clusters so produced. There is an extra overehead cost in terms of 
number of distance computations to be paid at query time when searching multiple clusterings. 
However, since each our distance computation involves only sparse vectors (i.e. we do not use 
dense centroids) , each distance computation is less expensive. The balance of the two effects is 
still positive for us as demonstrated by the graphs in figure 1. 

5.3 The weight embedding scheme of [18] 

In [18] several schemes and variants are compared but experiments show that the best perfor- 
mance is consistently attained by Query Algorithm 3 (CellDec) described in [18, Section 5.4]. 
The preprocessing is as follows. For simplicity we consider the 3-dimensional case, that is a 
data set where each record has 3 distinct sources of evidence (e.g. in out tests, title, author and 
abstract of a paper) . We consider the set T of positive weight- vectors summing to one (this is 
the intersection of the hyperplane Wi -|- W2 + W3 = 1 with the positive coordinate octant). We 
split T into 4 regular triangles Ti, T2 and T3 each incident to a vertex of T and the central 
region T4. Let V^^ be the vector corresponding to record j and source i. Region T4 is the central 
one and weights in T4 are not too different form each other, therefore we form a composite 
vector as follows V{T4y — Vl -\- V2 + V3 . For the other four regions we apply a squeeze factor 
9 for the vector spaces corresponding to the lower weights. V{Tiy — + 9V2 + ■ We build 



similarly V{T2y and V{T3y . Experiments in [18] show that a value of ^ = 0.5 attains the best 
results. At query time, given the query Q = {q, w) one fist detects the region of T containing 
w, then uses q in the associated indexing data structure for cluster-pruning. 



5.4 The clustering scheme of [3] 

In a forthcoming paper [3] Chierichetti et al. propose a very simple but effective scheme for 
doing approximate k-nearest neighbor search for documents. In a nutshell, after mapping n 
documents in a vector space they choose randomly K = sjn such documents as representatives, 
and associated each other document to its closest representative. Afterwards, for each group 
the centroid is computed as "leader" of the group to be used during the search. In [3] the 
authors are able to prove probabilistic bounds on the size of each group which is an important 
parameter that directly influences the time complexity of the cluster prune search. Dynamically 
weighted queries are not treated in [3] , therefore we choose as a second base-line to employ [3] , 
in place of K-means, within the weighting framework of [18]. 



6 Measuring Output Quality 

We compare the results provided by the three algorithms using two quality indexes (employed 
also in [18] and [3]): the mean competitive recall and the mean normalized aggregate goodness. 
Mean Competitive Recall. Let k be the number of similar documents we want to flnd (in 
our experiments k=10) and A{k, q, E) the set of the k retrieved document by algorithm A on 
data set E, and GT{k,q,E) the "Ground truth", the set of the k closest points in E to the 
query q which is found through an exhaustive search; the competitive recall is Cit!(>l, g, k) = 
\A(k, q, E)r\GT{k, q, E)\. Note that the Competitive recall is a real number in the range [0, .., k] 
and a higher value indicated higher quality. The Mean Competitive Recall CR is the average 
of the competitive recall over a set of queries Q: 

CR(A, Q,E)^-^Y1 CR(A, q, k) 

This measure tell us how many of the true k nearest neighbors are our algorithm is able to find. 

Mean Normalized Aggregate Goodness. We define as the Farthest Set FT{k, q, E) the 
set of k points in E farthest from q. Let the sum of distances of the A- furthest points from q 
be Wik^^q^iE) = J2peFS(k,q,E) I^{QjP)- The normalized aggregate goodness: 

NAGik,q,A) - 



W{k, q, E) - EpeGT{k,q,E) K<1,P) 



Note that the Normahzed Aggregate goodness is a real number in the range [0, ..,1] and a 
higher value indicated higher quality. The Mean Normalized Aggregate Goodness NAG is the 
average of the normalized aggregate goodness over a set of queries Q: 

NAG{A, Q,E)^-^Y1 NAG(A, q, k). 

Among the possible distance functions there is a large variability in behavior (for example 
some distance functions are bounded, some are not). Moreover for a given E and q there could 
be very different range of possible distance values. To filter out all these distortion effects 
we normalize the outcome of the algorithm against the ground truth by considering the shift 
against the k worst possible results. This normalization allows us a finer appreciation of the 
different algorithms by factoring out distance idiosyncratic or border effects. 

7 Experiment 

In our experiment we compared the following algorithms: 

A) The CellDec algorithm described in [18] with k-means clustering and weighted cosine dis- 
tance. 

B) The algorithm proposed in [3] based on random cluster algorithm and weighted cosine 
distance, christened PODS07 for lack of a better name. 

C) The algorithm proposed here based on the furthest point first algorithm and weighted cosine 
distance (referred to as Our). 

We implemented all the algorithms in Python. Data were stored in textual bsd databases. 
Tests have been run on a Intel(R) Pentium(R) D CPU 3.20GHz with 4GB of RAM and with 
operating System Linux. 

Following [18] we have downloaded the first one hundred thousands Citeseer bibliographic 
records^. Each record contains three fields: paper title, authors and abstract. We built two data 
sets: TSl with the first 53722 documents and TS2 with all 100000 downloaded documents. 
After applying standard stemming and stop words removal, three vector spaces were created: 
one for each field of the documents. Terms in the vector are weighted according to the standard 
tf-idf schema. Details are in Table 1. 

Without loss of generality, as queries we used documents extracted from the data set. Test 
queries have been selected by picking a random set of 250 documents. During searches the 
exact match of the query document is not counted. In our experiments we adopted the 7 sets 
of weights used in [18]. For each set of weights, we always used the same query set. This gave 
us the opportunity of comparing results for different choices of the weights vector. The query 
time as a function of the number of clusters visited is in Figure 1 and shows clearly the speed 
up factor of two. 



^ http: //citeseer. ist.psu.edu/ 
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Table 1. Measure of input complexity. Preprocessing time (in hours and minutes) and storage (in Megabytes) of the 
two data structures generated by CellDec, PODS07 and our algorithm. 
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Fig. 1. Average query time (in seconds) over all queries in function of the number of visited clusters. 



8 Conclusions and future work 

We have shown that a difficult searching problem with dynamically chosen weights can be 
reduced, thanks to the linearity properties of the cosine similarity metric, to a simpler static 
search problem. For this problem we provide efficient and effective method that are competitive 
with state of the art techniques for large semi-structured textual databases. We plan in future 
work to extend and test our techniques to handling other types of data (e.g. images, and 
sound). We wish to thank P. Raghavan for introducing us to cluster pruning techniques and A. 
Panconesi for many useful discussions and for providing a preprint of [3]. 
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0.905 
0.867 
0.915 


0.914 
0.874 
0.921 


0.919 
0.881 
0.925 


0.924 
0.887 
0.927 




Query weights 0.4-0.4-0.2 - CcllDcc weights 1-1-1 


Recall 


CellDec 

PODS07 

Our 


4.812 
4.512 
6.128 


5.184 
5.032 
7.168 


5.336 
5.196 
7.64 


5.472 
5.284 
7.832 


5.544 
5.372 
7.916 


5.644 
5.444 
7.984 


5.492 
4.852 
6.848 


5.536 

5.18 

7.708 


5.704 
5.368 
8.08 


5.776 
5.444 
8.268 


5.86 

5.528 

8.392 


5.904 

5.6 

8.448 


5.972 
5.652 
8.48 


NAG 


CcllDcc 

PODS07 

Our 


0.769 
0.743 
0.778 


0.811 
0.807 
0.833 


0.830 
0.821 
0.856 


0.844 
0.832 
0.869 


0.855 
0.842 
0.872 


0.866 
0.853 
0.875 


0.852 
0.771 
0.836 


0.869 
0.819 
0.883 


0.884 
0.843 
0.903 


0.899 
0.860 
0.909 


0.908 
0.867 
0.916 


0.914 
0.875 
0.919 


0.918 
0.881 
0.921 




Query weights 0.2-0.4-0.4 - CcllDcc weights 1-1-1 


Recall 


CollDcc 

PODS07 

Our 


3.864 
3.772 
6.356 


4.06 

4.168 

7.116 


4.148 
4.284 
7.516 


4.284 

4.3 

7.624 


4.312 
4.284 
7.704 


4.404 
4.328 
7.76 


4.78 

4.0 

6.96 


4.692 

4.2 

7.708 


4.796 
4.344 
8.004 


4.916 
4.428 
8.076 


4.956 

4.5 

8.184 


5.004 
4.552 
8.24 


5.072 

4.58 

8.268 


NAG 


CellDec 

PODS07 

Our 


0.698 
0.679 
0.762 


0.737 
0.738 
0.807 


0.756 
0.753 
0.827 


0.774 
0.763 
0.836 


0.786 
0.772 
0.840 


0.797 
0.783 
0.842 


0.798 
0.725 
0.819 


0.811 
0.763 
0.870 


0.828 
0.785 
0.883 


0.847 
0.800 
0.887 


0.857 
0.808 
0.896 


0.863 
0.817 
0.898 


0.868 
0.825 
0.900 




Query weights 0.4-0.2-0.4 -CellDec weights 1-1-1 


Recall 


CellDec 

PODS07 

Our 


4.0 

3.752 

5.608 


4.176 
4.104 
7.048 


4.292 
4.188 
7.664 


4.312 
4.256 
7.932 


4.324 
4.204 
8.096 


4.352 
4.244 
8.176 


4.388 
3.792 
5.988 


4.396 
4.172 
7.272 


4.444 
4.284 
7.82 


4.4 

4.312 

8.136 


4.412 
4.312 
8.44 


4.444 
4.308 
8.516 


4.456 
4.296 
8.608 


NAG 


CellDec 

PODS07 

Our 


0.791 
0.757 
0.786 


0.830 
0.815 
0.869 


0.845 
0.828 
0.901 


0.851 
0.839 
0.916 


0.858 
0.849 
0.922 


0.865 
0.856 
0.926 


0.834 
0.762 
0.817 


0.855 
0.813 
0.895 


0.863 
0.834 
0.924 


0.868 
0.848 
0.934 


0.874 
0.852 
0.943 


0.877 
0.855 
0.946 


0.880 
0.858 
0.949 




Query weights 0.2-0.6-0.2 - CellDec weights 0.5-1-0.5 


Recall 


CollDcc 

PODS07 

Our 


4.084 
3.496 
6.392 


4.18 

3.684 

7.008 


4.236 
3.848 
7.22 


4.312 
3.932 
7.344 


4.388 
3.964 
7.4 


4.428 

4.04 

7.448 


4.548 
4.112 
7.024 


4.696 
4.252 
7.632 


4.74 

4.308 

7.824 


4.74 

4.444 

7.976 


4.792 
4.492 
8.028 


4.828 
4.516 
8.056 


4.848 

4.54 

8.08 


NAG 


CellDec 

PODS07 

Our 


0.770 
0.668 
0.740 


0.802 
0.702 
0.775 


0.818 
0.734 
0.788 


0.828 
0.751 
0.799 


0.842 
0.762 
0.801 


0.848 
0.769 
0.805 


0.870 
0.770 
0.814 


0.900 
0.795 
0.849 


0.914 
0.816 
0.861 


0.923 
0.835 
0.867 


0.927 
0.844 
0.873 


0.928 
0.852 
0.876 


0.930 
0.859 
0.878 




Query weights 0.6-0.2-0.2 - CellDec weights 1-0.5-0.5 


Recall 


CellDec 

PODS07 

Our 


3.172 
2.716 
5.76 


3.308 

3.14 

7.236 


3.376 
3.216 
7.848 


3.396 
3.292 
8.156 


3.424 
3.336 
8.32 


3.44 
3.36 
8.412 


3.632 
3.044 
5.808 


3.944 

3.44 

7.132 


3.968 

3.62 

7.728 


4.012 
3.736 
8.128 


4.0 

3.824 

8.32 


4.024 
3.876 
8.488 


4.016 
3.884 
8.632 


NAG 


CellDec 

PODS07 

Our 


0.809 
0.725 
0.795 


0.845 
0.793 
0.883 


0.861 
0.823 
0.913 


0.867 
0.839 
0.930 


0.870 
0.849 
0.936 


0.874 
0.856 
0.939 


0.803 
0.702 
0.812 


0.852 
0.784 
0.891 


0.861 
0.823 
0.921 


0.865 
0.836 
0.936 


0.869 
0.849 
0.945 


0.874 
0.860 
0.953 


0.874 
0.862 
0.957 




Query weights 0.2-0.2-0.6 - CellDec weights 0.5-0.5-1 


Recall 


CcllDcc 

PODS07 

Our 


3.384 
3.168 
r).8l2 


3.532 
3.436 
7.108 


3.64 

3.604 

7.728 


3.736 

3.7 

7.92 


3.832 

3.74 

8.()()1 


3.892 

3.764 
8.161 


4.176 

3.584 
(>.:>2 


4.312 
3.876 
7. l:i2 


4.424 
3.996 
7.89() 


4.48 
4.08 
8. IK) 


4.5 

4.148 
8.. -52 


4.508 

4.244 
8.1 


4.556 

4.292 
8.52 


NAG 


CcllDcc 

PODS07 

Our 


0.773 
0.737 
0.773 


0.806 
0.785 
0.859 


0.828 
0.812 
0.887 


0.840 
0.825 
0.896 


0.856 
0.835 
0.902 


0.866 
0.840 
0.908 


0.853 
0.755 
0.837 


0.869 
0.807 
0.889 


0.884 
0.834 
0.914 


0.889 
0.845 
0.923 


0.894 
0.852 
0.933 


0.896 
0.862 
0.936 


0.903 
0.866 
0.939 



Table 2. Quality results of the compared algorithms. Recall is a number in 
a number in [0,1]. Data as a function of the number of visited clusters. 



0,10], Normalized Aggregated Goodness is 



